SIMILARITY ALGORITHMS FOR FUZZY JOIN COMPUTATION IN BIG DATA PROCESSING ENVIRONMENT

نویسندگان

چکیده

Big data processing is attracting the interest of many researchers to process large-scale datasets and extract useful information for supporting providing decisions. One biggest challenges problem querying large datasets. It becomes even more complicated with similarity queries instead exact match queries. A fuzzy join operation a typical frequently used in big analysis. Currently, there very little research on this issue, thus it poses significant barriers efforts improving query operations efficiently. As result, study overviews algorithms joins, which at key attributes may have slight differences within threshold. We analyze six including Hamming, Levenshtein, LCS, Jaccard, Jaro, Jaro - Winkler, show difference between these through three criteria: output enrichment, false positives/negatives, time algorithms. Experiments joins are implemented Spark environment, popular platform. The divided into two groups evaluation: group 1 (Hamming, LCS) 2 (Jaccard, Winkler). For former, Levenshtein has an advantage over other terms high accuracy result set (false positives/negatives), acceptable time. In letter, Jaccard considered worst algorithm considering all criteria mean while Winkler richness higher set. overview will help users choose most suitable their problems.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient Join Algorithms for Integrating XML Data in Grid Environment

For its self-description feature, XML can be used to represent information in grid environment. Querying XML data distributed in grid environment brings new challenges. In this paper, we focus on join algorithms in result merge step of query processing. In order to transmit results efficiently, we present strategies of data compacting, as well as 4 join operator models. Based on the compacted d...

متن کامل

A Fuzzy TOPSIS Approach for Big Data Analytics Platform Selection

Big data sizes are constantly increasing. Big data analytics is where advanced analytic techniques are applied on big data sets. Analytics based on large data samples reveals and leverages business change. The popularity of big data analytics platforms, which are often available as open-source, has not remained unnoticed by big companies. Google uses MapReduce for PageRank and inverted indexes....

متن کامل

Cloud Computing Technology Algorithms Capabilities in Managing and Processing Big Data in Business Organizations: MapReduce, Hadoop, Parallel Programming

The objective of this study is to verify the importance of the capabilities of cloud computing services in managing and analyzing big data in business organizations because the rapid development in the use of information technology in general and network technology in particular, has led to the trend of many organizations to make their applications available for use via electronic platforms hos...

متن کامل

Fast similarity join for multi-dimensional data

To appear in Information Systems Journal, Elsevier, 2005 The efficient processing of multidimensional similarity joins is important for a large class of applications. The dimensionality of the data for these applications ranges from low to high. Most existing methods have focused on the execution of high-dimensional joins over large amounts of disk-based data. The increasing sizes of main memor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Journal of Computer Science and Cybernetics

سال: 2022

ISSN: ['1813-9663']

DOI: https://doi.org/10.15625/1813-9663/17589